Determining individuals most susceptible to a disease allows productive resource allocation. For diseases such as Dementia, individuals both inherit risk factors and accrue them throughout life. No factor is causative on its own, but understanding what contributes to a high risk allows the public health sector to assess and prevent potential health crises at a population level (Baumgart et al. 2015). Dementia is a clinical syndrome characterized by difficulties in memory and language, psychological and psychiatric changes, and impairments in activities of daily life (Burns and Iliffe 2009). Dementia’s complex list of possible symptoms are reflected in its causes. Common origins of dementia can be degenerative neurological diseases such as Parkinson’s or Alzheimer’s; however vascular disorders in the brain, traumatic head injuries and some infections can lead to a dementia diagnosis.
The data used in analysis is attained from a longitudinal study of 150 participants. Participants were right-handed, either male of female and aged between 60 and 96. They were characterized as either nondemented, demented or converted (became demented throughout the course of the study). For each session, participants took part in T1 weighted MRI scans, the results of which are recorded in visit_data. Participants underwent 2 or more sessions, each separated by at least a year.
This workflow aims to look at two questions. What factors are associated with an increased risk of dementia and what factors are associated with an increased risk over time. It is important to note, no one determinant causes dementia. The profiles of two people characterized as suffering with dementia maybe completely different.
Workflow is produced with R, a statistical computing language, (R Core Team 2020) and R Markdown which generates this html report.(Allaire et al. 2020). The bookdown package is used to add features to R Markdown such as cross referencing (Xie 2016).
Data is imported using R, the tidyverse (Wickham et al. 2019) and readxl (Wickham and Bryan 2019) packages.
Raw data is two excel sheets within the same spreadsheet dementia.xlsx. The first sheet, visit_data, contains information regarding visit numbers and MRI results. The second sheet, patient_data, has information on current dementia status, sex, and education and social status. Each row is one patient’s data at one given time. Replicate subject_IDs can be seen as some patients had data collected once a year over a course of multiple years. Explanations of each column can be seen in 2.1.
| Term | Definition |
|---|---|
| MMSE | Mini-Mental State Examination score (range: 0 = worst to 30 = best). A 30-point questionaire used to measure cognitive impairement. A score above 24 is considered normal. Lower scores may correlate with dementia although this is not true in every case. |
| CDR | Clinical Dementia rating (0 = no impairment, 0.5 = questionable, 1 = mild, 2 = moderate, 3 = severe). A clinical tool that measures relative dementia symptoms based on 6 domains (memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care) |
| eTIV | Estimated total intracranial volume (mm3) |
| nWBV | Normalized whole-brain volume (%) |
| ASF | Atlas Scaling Factor (unitless) |
| M_F | Patient sex, Female is represented by a 1, Male is represented by a 2 |
| EDUC | Years of Education |
| SES | Socioeconomic status, assessed by Hollingshead Four Factor Index Of Social Status, measures the social status of an individual based on 4 domains: marital status, retired/employed status, educational attainment, and occupational prestige. A score of 1 indicates high status, while 5 indicates lowest status |
3 data sets were created with the raw data. Each one starts by merging visit_data and patient_data into one by subject ID. Post import and merging, data variable names are cleaned with the janitor (Firke 2020) package. From here they differ as described below:
Dementia: used to look at which factors are associated with an increased risk of dementia. Columns not used in analysis (subject_id, visit, group and mri_number) and rows with NA values are removed. Finally, the values in m_f have been converted to numerical values for analysis (F = 1, M = 2).
Dementia2: used to look at factors contributing to dementia over time. The aim of this data set was use in either a paired Student’s t-test or paired samples Wilcoxon test. In this data set, the cdr and mri_number columns and NA rows were removed. The rows are rearranged into visit number ascending order. The values in m_f have been converted to numerical values for analysis (F = 1, M = 2). The Nondemented and Demented rows from the group column are removed as these did not change over time. Only visits 1 and 2 are kept, for most subjects there was no data for visit number 3 or higher. OAS2_ are removed from the subject_id strings. Unique subject_ids were removed as they do not have pairs. Finally, the visit levels were ordered, purely to have start and end in order in the boxplots.
Dementia_extract: used to generate some of the values used for inline reporting. In this data set all repeated subject_ids were removed so that accurate numbers about the number of participants could be recorded.
Using the dementia data set, this section looks at which determinants correlate with a high clinical dementia rating (CDR). In other words which determinants are linked with dementia. A list and explanation of the determinants used in this analysis can be seen in 2.1.
ggplot2 from the tidyverse package (Wickham et al. 2019). Arrangement of plots into a grid was achieved using ggarrange from the ggpubr package (Kassambara 2020).
Figure 3.1: Scatter Plots That Demonstrate Correlations Between Determinant And A High CDR
Table generated using the kableExtra package (Zhu 2020).
| Determinant | CDR | Mean | N | Standard Deviation | Standard Error | Minimum | Maximum |
|---|---|---|---|---|---|---|---|
| Age | 0.0 | 77.1553398 | 206 | 8.0894478 | 0.5636185 | 60.000 | 97.000 |
| Age | 0.5 | 77.4363636 | 110 | 7.3015359 | 0.6961741 | 62.000 | 92.000 |
| Age | 1.0 | 74.3714286 | 35 | 6.8645968 | 1.1603286 | 61.000 | 96.000 |
| Age | 2.0 | 85.0000000 | 3 | 11.2694277 | 6.5064071 | 78.000 | 98.000 |
| MMSE | 0.0 | 29.2233010 | 206 | 0.9205729 | 0.0641394 | 25.000 | 30.000 |
| MMSE | 0.5 | 26.4636364 | 110 | 3.0400304 | 0.2898555 | 17.000 | 30.000 |
| MMSE | 1.0 | 20.3142857 | 35 | 5.2735267 | 0.8913887 | 4.000 | 30.000 |
| MMSE | 2.0 | 20.3333333 | 3 | 5.0332230 | 2.9059326 | 15.000 | 25.000 |
| eTIV | 0.0 | 1486.8592233 | 206 | 179.9986303 | 12.5410988 | 1106.000 | 2004.000 |
| eTIV | 0.5 | 1482.4545455 | 110 | 174.0359889 | 16.5936805 | 1143.000 | 1928.000 |
| eTIV | 1.0 | 1528.0000000 | 35 | 157.8443015 | 26.6805566 | 1274.000 | 1957.000 |
| eTIV | 2.0 | 1538.0000000 | 3 | 157.4452286 | 90.9010451 | 1401.000 | 1710.000 |
| nWBV | 0.0 | 0.7404515 | 206 | 0.0373497 | 0.0026023 | 0.644 | 0.837 |
| nWBV | 0.5 | 0.7205182 | 110 | 0.0345072 | 0.0032901 | 0.646 | 0.806 |
| nWBV | 1.0 | 0.6990571 | 35 | 0.0224564 | 0.0037958 | 0.657 | 0.756 |
| nWBV | 2.0 | 0.7066667 | 3 | 0.0503322 | 0.0290593 | 0.660 | 0.760 |
| ASF | 0.0 | 1.1971068 | 206 | 0.1405721 | 0.0097941 | 0.876 | 1.587 |
| ASF | 0.5 | 1.1995091 | 110 | 0.1365395 | 0.0130185 | 0.910 | 1.535 |
| ASF | 1.0 | 1.1600286 | 35 | 0.1146708 | 0.0193829 | 0.897 | 1.377 |
| ASF | 2.0 | 1.1490000 | 3 | 0.1146865 | 0.0662143 | 1.026 | 1.253 |
| EDUC | 0.0 | 15.1601942 | 206 | 2.7047506 | 0.1884489 | 8.000 | 23.000 |
| EDUC | 0.5 | 14.0090909 | 110 | 3.1781809 | 0.3030277 | 6.000 | 20.000 |
| EDUC | 1.0 | 14.0000000 | 35 | 2.4970571 | 0.4220797 | 8.000 | 20.000 |
| EDUC | 2.0 | 17.0000000 | 3 | 3.0000000 | 1.7320508 | 14.000 | 20.000 |
| SES | 0.0 | 2.3349515 | 206 | 1.0497116 | 0.0731369 | 1.000 | 5.000 |
| SES | 0.5 | 2.6818182 | 110 | 1.2186006 | 0.1161890 | 1.000 | 5.000 |
| SES | 1.0 | 2.5714286 | 35 | 1.2434703 | 0.2101848 | 1.000 | 5.000 |
| SES | 2.0 | 1.6666667 | 3 | 1.1547005 | 0.6666667 | 1.000 | 3.000 |
ggplot2 from the tidyverse package (Wickham et al. 2019). Arrangement of plots into a grid was achieved using ggarrange from the ggpubr package (Kassambara 2020).
Figure 4.1: Boxplots Showing Deterimant Data At The Start And End Of The Study In Converted Patients.
| Determinant | Date | Mean | N | Standard Deviation | Standard Error | Minimum | Maximum |
|---|---|---|---|---|---|---|---|
| Age | Start | 76.1666667 | 12 | 7.7440104 | 2.2355032 | 65.000 | 86.000 |
| Age | End | 78.8333333 | 12 | 7.0817734 | 2.0443319 | 67.000 | 88.000 |
| MMSE | Start | 29.3333333 | 12 | 0.9847319 | 0.2842676 | 27.000 | 30.000 |
| MMSE | End | 28.0000000 | 12 | 2.0889319 | 0.6030227 | 24.000 | 30.000 |
| eTIV | Start | 1437.9166667 | 12 | 143.7722051 | 41.5034607 | 1264.000 | 1704.000 |
| eTIV | End | 1446.2500000 | 12 | 150.2864689 | 43.3839666 | 1275.000 | 1722.000 |
| nWBV | Start | 0.7402500 | 12 | 0.0350691 | 0.0101236 | 0.693 | 0.799 |
| nWBV | End | 0.7284167 | 12 | 0.0361398 | 0.0104327 | 0.677 | 0.788 |
| ASF | Start | 1.2311667 | 12 | 0.1176001 | 0.0339482 | 1.030 | 1.388 |
| ASF | End | 1.2250000 | 12 | 0.1218345 | 0.0351706 | 1.019 | 1.376 |
| EDUC | Start | 15.5000000 | 12 | 2.5761141 | 0.7436601 | 12.000 | 20.000 |
| EDUC | End | 15.5000000 | 12 | 2.5761141 | 0.7436601 | 12.000 | 20.000 |
| SES | Start | 1.8333333 | 12 | 1.0298573 | 0.2972942 | 1.000 | 4.000 |
| SES | End | 1.8333333 | 12 | 1.0298573 | 0.2972942 | 1.000 | 4.000 |
In addition, a LDA model and a questionnaire whose responses are fed into the model can be found here: dementia_grouping_questionnaire.Rmd. The unique packages used in this are as follows: caret (Kuhn 2020), MASS (Venables and Ripley 2002) , shiny (Chang et al. 2020) and shinyforms (Attali, n.d.). Explanation of package use can be found in the linked Rmd file. The model is trained to predict dementia grouping (demented or nondemented).
While this data set is insightful for a large range of medical and social determinants, it is in many ways limited. This is partly due to its longitudinal nature, data of this kind is time consuming to gather and getting willing participants is tricky. As a result, participants shared a common theme in this willingness, reducing generalizability to the whole population. All participants were right handed, there is conflicting evidence as to whether this increases (Ryan, Kreiner, and Paolo 2020) or decreases (Leon et al. 1986) incidence of dementia onset (caused by Alzheimer’s disease). Either way this also reduce generalizability. The data set included 3 people with a CDR of 2.0, 0 with a CDR of 3.0/4.0 and 351 with a CDR of less than 2.0. Issues with consent may mean its harder to get participants with moderate to severe dementia. This led to some unexpected results, increased age was shown to not significantly increase CDR (p value = 0.44) but this contradicts studies which have shown age exponentially increases risk up to 90. (Jorm and Jolley 1998). Table 3.1 shows a higher age in the CDR 2.0 bracket but the small sample size means this is not demonstrated by 3.1. Patient data could be expanded to include other determinants or to break down determinants used in this analysis. For example, it has been speculated age risk is due to associated factors such as higher blood pressure, changes to cell structure or the weakening of body repair systems. This workflow could be altered to include these by changing columns used where necessary (statistics, plots, summary tables, data tidying). Size affected dementia2 work due to the small converted sample size (24). To get around this visit data was converted to having two levels (start and end) rather than multiple (visit 1, 2 , 3 etc). So while it identifies determinants that led to an increase over time, it does not specify any time length, this is another area which could be looked at further with more data. Overall having a longitudinal study allowed temporal aspects of dementia onset to be considered but led to limited data collection.
Word count is calculated using wordcountaddin (Marwick 2020).
This rmd script: 1245
The dementia grouping questionnaire script: 298
The README: 295
Total: 1838